Explanation reporting based on differentiation between items in different data groups

ABSTRACT

In an embodiment, a process for explanation reporting based on differentiation between items in different data groups includes obtaining model scores from a first machine learning model and training a second machine learning model to learn how to differentiate between two groups based on at least one of: features and the model scores obtained from the first machine learning model. The process includes applying the second machine learning model to each data record in a first group of data records to determine a corresponding ranking score for each data record in the first group, and based on the corresponding ranking scores, determining a relative contribution of each of the data records in the first group to the differentiation between the first group of data records and a second group of data records.

CROSS REFERENCE TO OTHER APPLICATIONS

This application claims priority to U.S. Provisional Patent Application No. 62/847,101 entitled AUTOMATIC MODEL MONITORING FOR DATA STREAMS filed May 13, 2019 which is incorporated herein by reference for all purposes.

BACKGROUND OF THE INVENTION

Sensitive data such as credit card numbers are increasingly being exchanged over the Internet with the evolution in point of sale systems as well as increasing popularity of online shops. Electronic security measures analyze transactional data to detect a security breach. The analysis of the transactional data includes classifying and interpreting the data. For example, a machine learning model is deployed into a data streaming scenario and the model is monitored to detect anomalous events or sudden changes in behavior.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.

FIG. 1A shows an example of an input data stream.

FIG. 1B shows an example of scores output by a machine learning model using the input data stream of FIG. 1A.

FIG. 1C shows a signal and a threshold generated by automatic model monitoring according to an embodiment of the present disclosure.

FIG. 2 is a flow chart illustrating an embodiment of a process for automatic model monitoring for data streams.

FIG. 3A shows an example of a target window and a reference window for an input data stream at a first point in time according to an embodiment of the present disclosure.

FIG. 3B shows an example of a target window and a reference window for an input data stream at a second point in time according to an embodiment of the present disclosure.

FIG. 3C shows an example of a target window and a reference window for an input data stream at a third point in time according to an embodiment of the present disclosure.

FIG. 4A shows an example of fixed-size contiguous windows according to an embodiment of the present disclosure.

FIG. 4B shows an example of time-based contiguous windows according to an embodiment of the present disclosure.

FIG. 4C shows an example of homologous windows according to an embodiment of the present disclosure.

FIG. 4D shows an example of homologous windows according to an embodiment of the present disclosure.

FIG. 5 is a flow chart illustrating an embodiment of a process for adaptive threshold estimation for streaming data.

FIG. 6 is a flow chart illustrating an embodiment of a process for redistributing positions for a set of percentile bins.

FIG. 7 shows an example of bins that are processed using an adaptive streaming percentiles estimator according to an embodiment of the present disclosure.

FIG. 8 shows an example of the effects of various exponential moving (EM) average weights.

FIG. 9 is a flow chart illustrating an embodiment of a process for explanation reporting based on differentiation between items in different data groups.

FIG. 10 is a flow chart illustrating an embodiment of a process for removing time correlated features in a data set.

FIG. 11 shows an example of an explanation report according to an embodiment of the present disclosure.

FIG. 12 is a block diagram illustrating an embodiment of a system in which automatic model monitoring for data streams can be implemented.

FIG. 13 is a functional diagram illustrating a programmed computer system for automatic model monitoring in accordance with some embodiments.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.

Model monitoring refers to monitoring machine learning models in production environments such as an environment that determines whether a fraud or security attack is happening by observing data streams of transactions. Data streams tend to change frequently and quickly in a non-stationary way. A model may misbehave because the attack pattern was not seen when the model was trained, a user does not collect certain fields expected by an API, or other engineering issues. A spike in transactions can be caused by a popular sale item, a fraud attack, or a data issue, among other things. A model may be made less strict to reduce false alarms in the case of popular sale items because these are legitimate transactions. A model may be made stricter to block more fraud attempts. To address a data issue such as an API change that makes data fields unavailable, the system platform may be updated. An example of a system for preventing fraud attacks is shown in FIG. 12. The examples here are fraud attacks but this is not intended to be limiting, and the disclosed techniques can be applied to other types of streaming data.

In an example setup, an application uses more than one machine learning model (sometimes simply called “model”), several machines with different environments, and receives data from several types of devices in different geographical locations. This relatively wide scope for unexpected behavior or sudden changes (i.e., concept drift) makes model monitoring challenging, especially if performed manually.

Concept drift is a change, over time, in the relation between the data collected to perform a classification task (to produce an interpretation of the data) and the corresponding true label collected for that data. Conventional automated methods of detecting concept drift require labels (which are often determined by an analyst) in order to accurately measure model performance. Conventional methods use the loss of the predictive model (e.g., cross entropy loss) to detect concept drift. Thus, if the labels are not immediately available after prediction, problems are detected too late. In other words, conventional methods typically cannot detect concept drift when labels are unavailable. In many domains, labels are often collected with several weeks of delay making conventional methods impractical for many streaming data applications.

In addition, conventional systems typically do not identify possible causes for concept drift. A fraud detection model in online payments could show a drift due to a popular sale item (with an increase in false positives) or due to a true fraud attack (with an increase in false negatives). Conventional model monitoring methods cannot detect or explain changes (concept drifts) before labels are available.

Automatic model monitoring for data streams is disclosed. The automatic model monitoring system detects changes in data streams (i.e., concept drift) using a time- and space-efficient unsupervised process. The disclosed model monitoring techniques can detect changes in behavior occurring in a relatively short time scale such as a few hours to a few days without needing labels. In an embodiment, a model monitoring process uses a stream of scores produced by a machine learning model to detect local changes in their distribution. An adaptive threshold is determined and applied to monitoring values calculated from the model scores to detect anomalous behavior. Monitoring values are sometimes collectively referred to as a signal here (e.g., the signal shown in FIG. 1C is made up of monitoring values). The automatic model monitoring system can explain the changes in behavior. For example, an explanation report is generated to explain the causes of the change such as a summary of events/data records and features explaining the change.

The following figures show an example of how the disclosed automatic model monitoring techniques performs a classification task. In particular, FIGS. 1A-1C show an example of a binary classification task in which anomalies are positively identified.

FIG. 1A shows an example of an input data stream. The plot shows the data stream over time. The data stream of events/data records represented in white circles that correspond to normal behavior and the events/data records (groups 102 and 104) represented by black circles correspond to anomalous behavior. In this example, there are two bot attacks in a fraud detection scenario: a first attack at 102 and a second attack at 104.

FIG. 1B shows an example of scores output by a machine learning model using the input data stream of FIG. 1A. The plot shows a time series of model scores produced by a machine learning model in response to input data of FIG. 1A. The model is not able to detect the first attack 102 because the risk scores are low. The model is able to detect the second attack 104 because the risk scores are high.

FIG. 1C shows a signal and a threshold generated by automatic model monitoring according to an embodiment of the present disclosure. The plot shows a signal (solid line) and an adaptive threshold (dashed line). The signal captures the score distribution of FIG. 1B. The signal provides a measure of similarity between the model scores distribution in a target window T (most recent events) and in a reference window R (older events). Examples of a target window and reference window are further described with respect to FIGS. 3A-3C.

The signal evolves over time as model scores corresponding to the data stream change. If the signal is larger than the threshold, an alarm is triggered as further described with respect to FIG. 2. Unlike the model in FIG. 1B, which detects an attack for scores 104 but not scores 102, the automatic model monitoring in FIG. 1C detects both attacks because the signal at A and the signal at B exceed the threshold. In various embodiments, when an alarm is triggered, a process for determining an explanation is performed as further described with respect to FIG. 9. For example, the explanation is determined by training a machine learning model to find a pattern that distinguishes events in a target window T from the events in a reference window R. The output score and the feature importance of that machine learning model is then used to summarize the characteristics of the alarm.

First, techniques for determining a signal by automatic model monitoring are described (FIGS. 2-4D). Next, techniques for determining an adaptive threshold are described (FIGS. 5-8). Finally, techniques for explanation reporting based on dissimilarities are described (FIGS. 9-11). FIG. 12 shows an example of a system for fraud detection in which the disclosed techniques can be applied.

FIG. 2 is a flow chart illustrating an embodiment of a process for automatic model monitoring for data streams. The process can be performed by a device such as node 1242.1 or 1242.2 of cluster 1240 (alone or in cooperation) or by a processor such as the one shown in FIG. 13.

The process begins by receiving an input dataset (200). In various embodiments, the input dataset includes events/data records in a stream of data. The input data may be received and processed in real time or near real time. For example, events representing financial transactions are received one-by-one as orders for drinks come in from a coffee shop merchant. As another example, the input data is received from a credit card issuer wishing to verify whether transactions are fraudulent. An example of how data is collected by transaction devices and becomes input data to this process is shown in FIG. 12. Referring to FIG. 4A, which shows streaming data made up of events (the black circles), the input dataset includes events in target window T The process of FIG. 2 can be repeated on new events that are received as data streams in. For example, as new events are received, the target window slides to the right so that in a subsequent iteration of the process, the input dataset includes the new events as further described below.

The process uses a machine learning model to determine a model score for each data record of at least a portion of the input dataset (202). A trained machine learning model takes the data as input and outputs a model score. A variety of machine learning models or other scoring methods can be used. Examples include (but are not limited to) random forests, gradient boosting models, neural networks, logistic regression, support vector machines. Examples of model scores are shown in FIG. 1B. For each data record (circle in FIG. 1A), a machine learning model determines a corresponding model score (bar in FIG. 1B).

Returning to FIG. 2, the process determines monitoring values (204). Each monitoring value is associated with a measure of similarity between model scores for those data records of the input dataset within a corresponding moving reference window and model scores for those data records of the input dataset within a corresponding moving target window. For example, a monitoring value is a measure of similarity between a model scores histogram in a reference window R and a model scores histogram in a target window T The T window contains the most recent events collected. The R window contains events in a reference period prior to the target window T. The reference and target window sizes can be fixed-size (predetermined number of events) or fixed-time (predetermined time duration). The windows can be contiguous or homologous, as further described with respect to FIGS. 4A-4C. An example of monitoring values is the signal shown in FIG. 1C, which is made up of a series of monitoring values.

The similarity between a model scores histogram in the reference window R and a model scores histogram in the target window T can be measured using a metric. One type of similarity metric is the Jensen-Shannon divergence. The Jensen-Shannon divergence measures mutual information between the random variable generated by a binary mixture model of the two distributions and the corresponding binary indicator variable. The Jensen-Shannon divergence is bounded and symmetric. When the distributions are the same, the measure goes to zero. When distributions have disjoint domains, the measure goes to log 2 (or 1 if entropy is measured in Shannon units). In addition to binary classification, the Jensen-Shannon divergence is also suitable for multi-dimensional distributions to compute the signal (monitoring values) in multi-class model monitoring use cases. The Jensen-Shannon divergence is an attractive similarity measure because it is stable, less noisy, and sensitive to relative magnitude. Other types of similarity metrics include the Kolmogorov-Smirnov, Kuiper, and Anderson-Darling test statistics. Any of these metrics or other metrics can be used to determine the similarity between the histograms.

The monitoring value can be calculated in a variety of ways using a similarity metric. Given a similarity metric, the corresponding monitoring value is calculated by applying an estimation procedure. By way of non-limiting example, the Jensen-Shannon divergence can be estimated by summing individual divergence contributions for each bin (comparing each bin in the histogram of model scores of the target window T with the same corresponding bin in the histogram of model scores of the reference window R). Other estimation procedures can be used for a given metric.

The process outputs the determined monitoring values (206). In various embodiments, the monitoring values are output by rendering the monitoring values on a graphical user interface. FIG. 1C shows an example of a signal made up of monitoring values plotted alongside a threshold. Another way the monitoring values can be output is outputting the monitoring values for further processing. In some embodiments, the process terminates after performing 206. In some embodiments, the process (optionally) proceeds by comparing the monitoring value(s) to a threshold and providing an indication that the monitoring value(s) meets/exceeds a threshold as follows.

The process detects that at least one of the monitoring values meets a threshold (208). When a monitoring value exceeds the threshold, a number of responses are possible. For example, the process triggers an alarm and the generation of an explanation report. As another example, the process blocks the attack (e.g., bot attack) and reports the attack to an administrator. As yet another example, the process reports that an attack happened and provides an explanation report listing transactions that may have been fraudulent. The threshold can be determined by applying an adaptive streaming percentiles estimator, an example of which is shown in FIG. 5. In some embodiments, if the monitoring value does not meet the threshold, the process continues processing transactions in a streaming fashion until the next monitoring value meets a threshold or until the data stream terminates.

The process provides an indication associated with the detection in response to the detection that at least one of the monitoring values meets the threshold (210). An indication (such as an alarm) is a notification of a change in behavior as indicated by the monitoring value meeting or exceeding a threshold. In some embodiments, a single indication is provided. In other embodiments, multiple indications are provided. For example, the process generates a first indication when a monitoring value has met the threshold and is rising. Later, the process generates a second indication when the monitoring value stops rising. This indicates a peak in the signal (monitoring values). When the process generates a single indication, it can output either the first indication (when monitoring values are rising) or the second indication (when the monitoring value is at a peak). An example of an indication is further described with respect to FIG. 9.

In some embodiments, the process terminates after 206 (or 208) when there are no more new data records. In some embodiments, additional iterations of the process can be performed by returning to 200 to receive new data records after 206 (or 208 if the monitoring value(s) do not meet the threshold or after 210). For example, as time progresses new events may be collected in a data stream so returning to 200 means another iteration of the process is performed to process the new events/data records that have come in. In some embodiments, the process is performed in a single iteration on a complete data set (after all events in a data stream have been collected) such as when testing the process or analyzing data not in real time.

The process will now be described using the example windows shown in FIGS. 3A-3C. In this example, the input data stream represents orders at a coffee shop. Each dot represents an event/data record, namely an order for a drink at the coffee shop. As shown in the input data stream, there are more orders each day in the morning around 6:00 and at noon. Since this is expected behavior (people tend to order more coffee in the early morning and at noon), the spike of activity is not fraud.

FIG. 3A shows an example of a target window and a reference window for an input data stream at a first point in time according to an embodiment of the present disclosure. The windows can be used to calculate a monitoring value at 204 of FIG. 2.

In FIG. 3A, events are received in a stream of data and the current time is Monday at 0:00. The oldest events are at the left side of the plot (beginning at 0:00 on Friday) and the most recent events are at the right side of the plot. The target window T contains the most recent four events. The reference window R contains earlier events, which are the four events immediately preceding the events in the target window in this example.

FIG. 3B shows an example of a target window and a reference window for an input data stream at a second point in time according to an embodiment of the present disclosure. The time is now Monday at 6:00, and a new event is received. Compared with FIG. 3A, the target window T moves forward in time (i.e., to the right) to contain the four most recent events. Similarly, the reference window R also moves to contain the four events immediately preceding window T.

FIG. 3C shows an example of a target window and a reference window for an input data stream at a third point in time according to an embodiment of the present disclosure. The time is now Monday at 8:00, and a new event is received. Compared with FIG. 3B, the target window T moves forward in time (i.e., to the right) to contain the four most recent events. Similarly, the reference window R also moves to contain the four events immediately preceding window T.

At each point in time, the monitoring value is determined by comparing the similarity between model scores for the events in the reference window R and model scores for the events in the target window T. For example, the Jensen-Shannon divergence is applied to events in windows R and T to determine the similarity. The monitoring value at Monday 0:00 (FIG. 3A) may be different from the monitoring value at Monday 6:00 (FIG. 3B), which in turn may be different from the monitoring value at Monday 8:00 (FIG. 3C).

In the examples in FIGS. 3A-3C, the reference window R and target window T are contiguous meaning that they are immediately next to each other. FIGS. 4A-4C show examples of different types of windows.

FIG. 4A shows an example of fixed-size contiguous windows according to an embodiment of the present disclosure. Here, window R and window T are fixed-size contiguous windows. The T window contains the most recent n_(T) events collected. The R window contains n_(R) events in a reference period immediately before T In this example, the fixed size is four events so each of the windows R and T contains four events. Contiguous windows may be attractive for detecting changes in behavior occurring in relatively short time scales (e.g., a few hours to a few days). A contiguous R window is well suited for short time scales because it provides a comparison between the T window and the most recent events preceding it. In some embodiments, for long-lived alarms, the process freezes the time location of the R window temporarily and slides only the T window forward in time until the alarm is over to avoid an exit peak. A long-lived alarm is one that lasts longer than the combined size of the target and reference windows.

The window size can be selected in a variety of ways. The size of the T window can be defined in units of the average number of events in some period (e.g., one hour, half a day, or one day). In an embodiment, the default size of the R window is three times the average number of daily events and the size of the T window is 0.5 times the average number of daily events.

Although in this example both T and R are the same size, they can be different sizes in other embodiments. For example, the R window size is chosen to be a multiple of the T window size (e.g., five times larger). The window can be sized based on the characteristics of the expected data. In various embodiments, the R window is at least as large as the T window in order to be more stable than the T window. The reference window defines the normal behavior so its histogram should not be noisier than the T histogram. The size of the R window and T window affects the amount of noise in the signal. Very short windows (e.g., 100 times smaller than the average number of daily transactions) tend to generate noisy signals, which result in more false alarms. On the other hand, very large windows (e.g. 30 times the average number of daily transactions) can make the signal insensitive to small changes in the distribution of model scores.

In various embodiments, fixed-size windows provide better control of estimators compared with other types of windows, since a fixed-size window fixes the dependency of the variance on the sample size and sample sizes are the same for all windows. In contrast, when comparing monitoring values for two different events using time-based windows, the comparison is made using monitoring values computed with two different sample sizes.

FIG. 4B shows an example of time-based contiguous windows according to an embodiment of the present disclosure. Here, window R and window T are fixed-time contiguous windows. The T window contains the events collected in the past 5 hours. The R window contains events in a reference period (5 hours) immediately before T. In this example, there is one event in T and two events in R.

FIG. 4C shows an example of homologous windows according to an embodiment of the present disclosure. Homologous windows can be used to calculate a monitoring value at 204 of FIG. 2. Homologous windows are regularly spaced windows with the same time duration as the corresponding target window. Thus, for a fixed-time target window, the corresponding homologous windows are also fixed-time. For a fixed-size target window (which will have a variable time duration), the corresponding homologous windows will have a matching (variable) time duration. Homologous windows may be used, for example, to cover the same period of the target window but on different previous days.

Homologous windows may be attractive for detecting changes in data with a strong seasonal behavior. An example of data that exhibits strong seasonality is certain types of events occurring more frequently at certain times of the day. For example, people tend to order coffee more frequently in the morning than the rest of the day. Thus, a coffee shop in a business district will see increased activity every weekday morning.

The R window is a set of replica windows occurring in the same period of the day as the T window but on previous days (homologous periods). In FIG. 4C this is depicted as reference windows R1-R4, which occurs around 6:00 on Monday through Thursday. More specifically, T occurs between 4:00 and 8:00 on a Friday, so a homologous window configuration with four replicas containing events from the four previous days (Monday through Thursday) from 4:00 to 8:00. The size of the reference window is not fixed, but its time duration is fixed to be the same as the T window duration in this example.

When comparing events in references windows R1-R4 and target window T, a histogram is made combining R1-R4, which is then compared with the histogram corresponding to target window T In the coffee scenario, contiguous windows may induce repetitive (e.g., daily) alarms because customers do not order many coffees after midnight and order many coffees in the early morning. On the other hand, homologous windows correct for such seasonality by recognizing that the repetitive behavior of many coffee orders each day in the early morning is similar to each other. Whether to use contiguous or homologous windows is configurable. For example, a user can set a system to use contiguous windows when expecting a certain type of data or homologous windows when expecting a different type of data.

FIG. 4D shows an example of homologous windows according to an embodiment of the present disclosure. Unlike FIG. 4C in which the target window T is defined based on time (5 hours), the target window here is fixed size, namely four events. As shown, the target window T includes four events, which correspond to approximately 12 hours (20:00 to 8:00) so the reference windows on the previous days (Friday through Tuesday) are also from 20:00 to 8:00.

The monitoring values obtained using the windows comparison are then compared with a threshold to determine changes in behavior. The threshold can be determined as follows.

Adaptive threshold estimation for streaming data is disclosed. An adaptive streaming percentiles estimator estimates percentiles for streaming data by using a fixed number of bins that are updated in a single linear pass. If a new monitoring value stands out compared with a distribution of previous monitoring values, then an alarm can be raised to further study the occurrence/anomaly or take remedial action. A threshold based on the estimated percentile can be used for automatic model monitoring. For example, the threshold is used as the threshold at 208 of FIG. 2 such that a monitoring value meeting or exceeding the threshold causes an indication (e.g., alarm) to be generated. The adaptive streaming percentiles estimator can be used for any streaming data including but not limited to fraud detection and analyzing user profiles.

The threshold can be calculated using a fixed percentile or a Tukey fence. A fixed percentile defines outlier values for the signal by flagging all values that fall in the upper tail of the distribution computed with the whole series (e.g., above the 95th percentile).

A Tukey fence is an alternative definition of outlier that focuses on the width of the central part of the distribution. For example, the outlier can be given by an upper Tukey Fence:

Q3+k(Q3−Q1)   (1)

where Q1 is the first quartile and Q3 is the third quartile. k>0 is a tunable parameter that controls how much the threshold is above Q3. For example, for a Gaussian distribution, k=1 corresponds to percentile 97.7 and k=1.5 corresponds to percentile 99.7. The upper Tukey fence may be attractive for streaming data because it focuses on the central part of the distribution. In a streaming data environment, any two consecutive values of the signal time series are highly correlated. This is because there is only one new instance entering the T window when a new event arrives (as described with respect to FIGS. 3A-3C). Thus, changes accumulate slowly and the signal varies almost continuously. This means that when a peak occurs in the signal, not only the value of the signal at the peak but also the neighboring points (which tend to be large as well) contribute to the tail of the distribution. Hence, the upper Tukey fence is an attractive choice because it is less sensitive to the tail.

Both methods (fixed percentile or Tukey fence) rely on the estimation of percentiles. The percentile estimation techniques described below can be applied to both methods as well as other outlier definitions. The techniques are reliable and flexible and can be used to calculate a threshold in either of the cases (fixed percentiles or Tukey fence) described above. In various embodiments, a fixed number of bins are updated all at once, with a single linear pass, which can then be used to estimate any percentile through interpolation. This approach is a stochastic approximation of the cumulative distribution function. When each new event is received, the percentiles are updated to restore an invariant such that the average count per bin is the same for all bins.

FIG. 5 is a flow chart illustrating an embodiment of a process for adaptive threshold estimation for streaming data. The process can be performed by a device such as node 1242.1 or 1242.2 of cluster 1240 (alone or in cooperation) or by a processor such as the one shown in FIG. 13.

The process begins by determining initial positions for a set of percentile bins (500). The initialization is performed as follows. The initial positions are determined using the first values that stream into the system. The number of percentile bins (n) can be pre-defined. For the first n+1 events that stream in, the event values are inserted into a global list P in sorted order. This initializes an estimate of the n+1 percentile positions. In various embodiments, the first n+1 events are unique. If they are not unique, then the initialization step includes injecting numerical noise into the event values, so that all initial percentile position values are unique.

The process receives a new data item in a stream of data (502). The process consumes the data items (also called “records” or “events”) as they stream into the system. The percentile position estimates are updated as events stream in. For each incoming event the percentile position estimates in global list P are updated taking into account the incoming event value and the current total count C. Redistributing positions updates the percentiles in each bin while maintaining the invariant that the estimated number of counts in each bin is the same for all bins as follows.

The process identifies one of the set of percentile bins corresponding to the new data item (504). The incoming data record can be classified into one of the bins. The process finds the appropriate bin and accounts for the incoming event as follows.

The process increments a count of items in the identified percentile bin (506). This accounts for classifying the incoming data record as belonging to the identified percentile bin. Increasing the count breaks the invariant, so the process will proceed to update percentiles as follows.

The process adjusts one or more counts of data items in one or more of the percentile bins including by applying a suppression factor based on a relative ordering of items (508). The suppression factor can be thought of as a forgetting factor (e.g., assigning a lower weight to older events) that makes an estimation of percentiles adaptive. This may be better for streaming data where the local distribution of monitoring values varies considerably over time, which leads to more accurate results. The suppression factor is predetermined (e.g., selected by a user) and can be applied as further described with respect to FIG. 8.

The process redistributes positions for the set of percentile bins to equalize respective count numbers of items for each percentile bin of the set of percentile bins (510). Redistributing positions of the bins restores the invariant after it was broken in 506. The process calculates a new target count for each bin and adjusts the size of each of the bins based on whether the count of a bin is less than or greater than the new target count. If the count of the bin is equal to the new target count then no adjustment is made to the bin's size. An example of a process for redistributing positions is shown in FIG. 6.

The process utilizes the redistributed positions of the set of percentile bins to determine a percentile distribution of the stream of data (512). The set of percentile bins that results from 508 gives a percentile distribution of the stream of data. The height of each bin is the same (the invariant). This provides a good resolution so that regions of low density and high density are covered in the same way. The percentile distribution gives an indication of whether a current event is anomalous. If the event is uncommon (goes above percentile 75 for example), then this may indicate a change in behavior such as fraud.

The process calculates a threshold based at least in part on the percentile distribution (514). In various embodiments, the threshold is obtained by applying an outlier definition. By way of non-limiting example, the outlier definition can be a fixed percentile or a Tukey fence.

In various embodiments, the threshold is obtained by further processing the outlier definition using delayed exponential weighting on previous estimates to obtain a final threshold. Applying delayed exponential weighting may be attractive because a local distribution of monitoring values can vary considerably with time if the data is non-stationary. Therefore, defining a threshold based on all past monitoring values may provide an inaccurate estimate of the local distribution of monitoring values (for example in the last month). The threshold can account for this by being adaptive and giving greater weight to more recent transactions as further described below.

The disclosed adaptive threshold estimation techniques have many advantages over existing methods by being more space-efficient, time-efficient, and reducing processing cycles needed to process streaming data. In one aspect, the process stores only a fixed size O(n) object with the positions of n+1 percentile estimates P≡[P0, P1, . . . , Pn], where P0 and Pn provide estimates of the lower/upper range of the domain of the distribution, respectively. In another aspect, the time complexity for each incoming event is O(n), so that on any new event all percentiles are updated in a single pass over the percentiles object. This means that in a streaming implementation each event is processed only once and the new estimate P only depends on the last estimate. Conventional methods tend to be more resource-intensive because they sample previously observed instances and keep them in memory, which requires managing clusters of samples including sorting operations.

The process shown in FIG. 5 can be repeated for each new additional data record received until an entire data stream is processed.

FIG. 6 is a flow chart illustrating an embodiment of a process for redistributing positions for a set of percentile bins. The process can be performed as part of another process such as 510 of FIG. 5.

The process calculates the new target count for each bin (602). In various embodiments, the new target count is the mean number of events per bin after adding the new event. Then, the process loops over all bins from left to right. For each bin, the process determines whether the bin's count is less than the new target count (604).

If the bin's count is less than the new target count, the process moves a wall of the bin in a first direction (606). In various embodiments, the process moves the right wall of the bin to the right (the first direction). This “eats into” a portion of the next bin (to the right of the current bin) based on its density.

If the bin's count is greater than the new target count, the process moves a wall of the bin in a second direction (608). The bin's count is greater than the new target count after encountering the bin into which the current event is sorted. In various embodiments, the process moves the left wall of the bin to the left (the second direction). This “sheds away” a portion to the next bin (to the right of the current bin) based on the current bin density.

Moving the walls of the bins (606 and 608) redistributes the positions of the bins so that the end result after all of the bins have been processed is that an invariant, namely the new target count, is maintained. The next figure shows an example of redistributing the positions by moving bin walls.

FIG. 7 shows an example of bins that are processed using an adaptive streaming percentiles estimator according to an embodiment of the present disclosure. Histogram 702 contains 10 bins where each bin is a percentile bin meaning that its wall (or boundary) represents an estimated percentile position of the events in the bin. The height of the bin represents how many events fall into that bin. Lower density bins are wider and higher density bins are narrower.

The height of the bins is an invariant that is maintained so that the heights of the bins are the same and the widths vary depending on how much the events are distributed. In various embodiments, the height is maintained as an invariant so that by the end of the redistribution process shown here the heights of all of the bins are the same (712). At intermediate steps (e.g., 704-710) the heights are not necessarily the same and the wall of the bin is moved to maintain the correct count for each bin. By the end of the redistribution process, the invariant (height) is restored for all bins.

When a new event is received, the event is placed (accounted for) in a bin and the bins are redistributed to maintain the same height for all bins while the widths are adjusted. In this example, the new event falls into Bin 7 so the count of Bin 7 increments as represented by its taller height compared with the other bins. That is, state 702 of the histogram is the result after performing 506 of FIG. 5. States 704-712 of the histogram show what happens when walls of the percentile bins are redistributed (moved) to equalize respective count numbers for each percentile bin. Moving bins walls corresponds to 510 of FIG. 5 and FIG. 6. Equalizing respective count numbers means restoring/maintaining an invariant across all bins.

The new target count (corresponding to 602 of FIG. 6) is represented by the dashed line. The process of FIG. 6 loops through all of the bins, and state 704 shows what happens when bins are redistributed by passing through the bins from left to right. Each of the bins will be updated by moving a wall of the bin to restore the invariant so that all of the bins are the same height.

Bin 1 (highlighted) is adjusted because the bin's count (height) is less (lower) than the new target count. The new target count can be a whole count or a fraction of a count. The bin is adjusted by making it taller (to reach the height of the new target count) and moving the right wall of the bin to the right. This corresponds to 606 of FIG. 6. Returning to FIG. 7, after 704, Bins 2-6 are each processed in the same way by moving their right walls to the right because the count of each of the bins is less than the new target count. State 706 shows the bins after Bins 1-6 have been processed.

Referring to state 706, the count of Bin 7 is greater than the new target count (taller than the dashed line). Since Bin's 7 count is not less than the new target, the right wall of Bin 7 is moved to the left and its height is lowered to meet the new target count. This corresponds to 608 of FIG. 6. Moving the right wall of Bin 7 to the left causes the height of the right adjacent bin (i.e., Bin 8) to increase as shown at 708. After adjusting the count of Bin 8, the count of Bin 8 exceeds the dashed line representing the target count.

Returning to FIG. 7, Bins 8-10 are each processed in the same way as Bin 7 by moving their right walls to the left. Referring to state 708, the count of Bin 8 exceeds the new target count, so its right wall is moved to the left. Consequently the count of Bin 9 is increased as shown. Next, at state 710, the right wall of Bin 9 is moved to the left because the count of Bin 9 exceeds the new target count. Consequently, the count of Bin 10 is increased as shown in 712. Because of the way the new target count was calculated, the resulting state of Bin 10 (and Bins 1-9) are such that the invariant is restored. State 712 shows the bins after Bins 8-10 have been processed. Bin 7 and the bins to the right (i.e., Bins 8-10) are shaded in a different pattern from the other bins to more clearly distinguish the two groups of bins from each other.

In some embodiments, the new event (which was placed in Bin 7 here) is smaller than the smallest value in the histogram. In this situation, the event is placed in Bin 1 and the left wall of Bin 1 is moved to the left to account for the event being smaller than the smallest value previously seen and Bin 1's count increases accordingly. Similarly, if the new event is larger than the largest value in the histogram, the event is placed in Bin 10 and the right wall of Bin 10 is moved to the right to account for the event being larger than the largest value previously seen and Bin 10's count increases accordingly.

In various embodiments, redistributing positions creates a directional bias in the estimate because the percentiles are updated from left to right. One way to correct this bias is to apply the update from right to left (in addition to left to right described above) and average the two results (i.e., the left to right pass and the right to left pass).

Another way to correct the bias that avoids duplicating the amount of work, is to choose between a left-right or right-left pass on each new incoming event either in an alternate way or with equal probability (to avoid reintroducing bias if the stream contains unfavorable correlations).

Next, updating the percentile distribution of the stream of data including by applying a suppression factor for each iteration to assign a lower weight to older events will be described (e.g., 512 of FIG. 5).

There are a variety of suppression factors (and ways to apply them) and the following example is merely illustrative and not intended to be limiting. One way of applying the suppression factor is to suppress the total count, which suppresses the histogram on any incoming event. For example, prior to adding a new event value to a bin (506), all bins are suppressed (e.g., multiply all values by 0.99). This gives higher weight to counts in bins that have recently received an instance, and suppresses the counts of bins that have not received instances recently. Here the suppression is applied at the level of the counts on the histogram to “forget” previous events directly. This is also memory lighter, because the total histogram count is saved without needing to save other values, whereas additional smoothing (as proposed by conventional techniques) requires saving all the smoothed out percentiles as well.

The suppression can be time-based or index-based. For example, index-based suppression uses a constant (index-based) decay rate 0 <γ<1 where n_(1/2) ^(γ)≡−log₂γ is the number of events to be processed to achieve a suppression factor of ½. In one framework, this would be several times the total number of events in the T plus R windows so that a higher importance is given to more recent monitoring values.

One advantage of an adaptive threshold based on Tukey Fences (with a forgetting factor) is that it gives greater weight to more recent monitoring values, so it adapts to changes in the distribution of monitoring values. However, this also means that when the signal starts increasing near an alarm, the threshold also tends to increase. To address this issue, a delay can be applied so that the threshold is more sensitive to monitoring values before the target window. A side effect of this approach is that the threshold increases, with a delay, after the peak in the signal. This prevents immediate alarms due to large signal fluctuations while the windows are passing through the alarm region. This may be desirable if one wants to prevent immediate alarms while the R and T windows have time to refill with new events. In an alternative embodiment, the adaptive streaming percentiles estimator is paused to prevent processing of monitoring values while the signal is larger than the threshold.

In various embodiments, a delay is applied through a delayed exponential moving (EM) average. This is attractive because a constant size state, to be updated on each new event, is saved without needing to store anything else. If the threshold values are z, with i=0, 1, . . . , j where j is the index of the latest event, then the EM sum is defined as:

S _(j) ^(α)≡Σ_(i=0) ^(j)α^(j−i)τ_(i)→τ_(j) +αS _(j−1) ^(α)  (2)

where S_(j) ^(α) is the EM smoothed out threshold sum, and 0<α<1 is the EM decay rate parameter. Similarly, for the EM count N_(j) ^(α)=1+αN_(j−1) ^(α), the delayed EM sum (or count) can be obtained by subtracting a second EM sum with a stronger decay rate β:

S _(j) ^(αβ)≡Σ_(i=0) ^(j)(α^(j−i)−β^(j−i))τ_(i) =αS _(j−1) ^(αβ)+(α−β)S _(j−1) ^(β)  (3)

The delayed EM average for the threshold is defined by dividing the delayed sum and delayed count to obtain a threshold:

$\begin{matrix} {\tau_{j}^{D} = {\frac{{\Sigma_{i = 0}^{j}\left( {\alpha^{j - i} - \beta^{j - i}} \right)}\tau_{i}}{\Sigma_{i = 0}^{j}\left( {\alpha^{j - i} - \beta^{j - i}} \right)} = \frac{S_{j}^{a\beta}}{N_{j}^{a\beta}}}} & (4) \end{matrix}$

This threshold is adaptive because it forgets older values of the signal. The decay rate parameter is related to the half-decay length n_(1/2) ^(α)=−log₂α (similarly to n_(1/2) ^(γ)). Similar definitions can be made for time based weights by replacing the indices i, j by time coordinates.

FIG. 8 shows an example of the effects of various exponential moving (EM) average weights. In various embodiments, smoothing can be applied. For example, exponential moving average smoothing is applied on the calculated threshold to obtain a new threshold as described above. There are a variety of ways to apply a suppression factor to assign a lower weight to older events. The suppression factor can be time-based, if it is proportional to the time lag since the previous event, or index based-based, if it is constant. The suppression factor can also include a delay, as described above for the adaptive threshold, or it can be a suppression without delay. An example of a suppression without delay is when a count of all bins is decreased prior to incrementing a count of items in the identified percentile bin as described above.

The circles running across the top of the plot represent the unweighted events (here they are all weight 1). The exponentially weighted events shown in the plot represent the same events after the delayed EM weights are applied (dark shaded area). For comparison, the two non-delayed weights are (α^(j−i)) and (β^(j−i)) as shown in FIG. 8. In various embodiments, delayed exponential moving average smoothing is applied to suppress effects of recent events. An example of this in FIG. 8 is the curve associated with α^(j−i)−β^(j−i), which gives lower weight to more recent events on the right side of the plot.

In various embodiments, when the monitoring value is larger than threshold τ_(j) ^(D), an alarm is triggered. However, that is not necessarily the peak of the signal, where the anomalous behavior may be clearer. As described above, in various embodiments, a first alarm is triggered and accompanied by a flag indicating that the signal is still increasing. Later, an updated alarm at the peak (or in periodic intervals until the peak is attained) is triggered.

The adaptive threshold can be used to determine that a monitoring value meets or exceeds the threshold, in which case an explanation report is generated as follows.

Explanation reporting based on differentiation between items in different data groups is disclosed. A report includes a summary of events and features that explain changes in behavior (e.g., concept drift). The report can be generated based on the automatic model monitoring and adaptive threshold estimation techniques disclosed herein.

FIG. 9 is a flow chart illustrating an embodiment of a process for explanation reporting based on differentiation between items in different data groups. The process can be performed by a device such as node 1242.1 or 1242.2 of cluster 1240 (alone or in cooperation) or by a processor such as the one shown in FIG. 13.

The explanation reporting is an example of an indication associated with detecting that monitoring values meet a threshold (210 of FIG. 2) or can be performed in response to determining that one or more monitoring values meet a threshold. The explanation report provides information about the characteristics of the subset of events in the target T window that caused the alarm. In various embodiments, the explanation report can trigger automatic remedial measures or can be helpful for a user to analyze the alarm and take further action.

The process obtains model scores for an input dataset from a first machine learning model (900). The first machine learning model can be trained to take data as input and output a model score for each data record in at least a portion of an input dataset. An example is 202 of FIG. 2.

The process trains a second machine learning model to learn how to differentiate between two groups (902). The second machine learning model is a classification model that differentiates between two groups based on the features and/or model score present in each of the data records. The set of features can contain a subset containing raw fields of the data record and/or transformations of the raw fields. The model scores can be generated by the first machine learning model by processing events in a target T window and a reference R window using a measure of similarity/dissimilarity. Examples of target and reference windows are described above. The process ranks the T window events according to how likely they are to explain the alarm. In various embodiments, the model score, used in the computation to produce the monitoring value as described in FIG. 2, provides on its own an aggregated view of each event and is used to rank the T window events (without also needing to use features). Other features of the events may provide further useful information. In various embodiments, the process uses a machine learning model that considers both features and model scores.

For each alarm, the process creates a new target binary label with value 1 for events in T (the first group) and value 0 for events in R (the second group) and trains the second machine learning model to learn how to separate events in the two windows. An example of the second machine learning model is a Gradient Boosted Decision Trees (GBDT) model. The GBDT model allows the process to obtain an alarm score that can be used to rank events in T (e.g., a higher score is closer to the top). In addition, the GBDT model may be attractive because it directly provides a measure of feature importance that handles correlated features well. The latter provides a way of ranking the features themselves. In various embodiments, the number of trees of the GBDT model is fixed to 50, and the maximum depth of the trees is fixed to 5.

The process applies the second machine learning model to each data record in the data records in the first group to determine a corresponding ranking score for each data record in the data records in the first group (904). The ranking pushes to the top the events that are responsible for distorting the distribution of model scores in the target window. In various embodiments, removing events from the top of the list will suppress the signal to restore the signal to be below the threshold.

The process determines a relative contribution of each of the data records in the first group to the differentiation between the first group of data records and the second group of data records based on the corresponding ranking scores (906). The relative contribution is an explanation of a cause of the alarm. For example, an account, card, user, etc. associated with the data record may be malicious.

In various embodiments, pre-processing is performed prior to training the machine learning model (902). The pre-processing addresses the potential issue that, in a machine learning model approach, some features may be correlated with time or (similarly) with the index that defines the order of the events. Due to the sequential nature of the window configuration (T comes after R), those features will allow the model to very easily learn how to separate the T window events from the R windows events using that time information instead of learning the differences in the distributions of features between the two windows. To prevent this, a pre-processing process is applied in a burn in period to detect features that correlate with time. Those features are then excluded from the training of the machine learning model. An example of a pre-processing process is shown in FIG. 10.

FIG. 10 is a flow chart illustrating an embodiment of a process for removing time correlated features in a data set. The process can be performed as part of another process, for example prior to 902 of FIG. 9. The process can be performed during a burn in period to detect time-correlated features. The burn in period is a set of initial events in the data stream used for initialization. For example during the burn in period, windows are filled up so that the monitoring values and time-correlated features can be determined. Removing time- or index-correlated features results in a better input to the machine learning model to yield better explanations.

The process begins by obtaining a data series for a feature X associated with a distribution of values that generated the data records (1000). For example, consider a time series:

[(t₀, X₀), . . . , (t_(i), X_(i)), . . . , (t_(N), X_(N))]  (5)

For streams of data with sizes above the thousands of instances, the time series for the feature values X_(i) in the data records provides a good estimate of the distribution of values associated with the process responsible for generating the data.

The process shuffles the data series randomly a predetermined number of times (1002). The process generates values by shuffling the series randomly M times. The number of times to shuffle the series can be selected to ensure a high statistical confidence that a feature has a high correlation and should be excluded. For example, the process generates around 60 values as further explained below.

The process calculates the corresponding values of a measure of correlation for each shuffle (1004). Whether there is a correlation between an ordered set of timestamps (or index values) T=[t₀, . . . , t_(i), . . . , t_(N)] and the feature values X=[X₀, . . . , X_(i), . . . , X_(N)] can be determined by using a measure of correlation that is sensitive to non-linear relations. One such measure of correlation is a Maximal Information Coefficient (MIC), which is bounded in the interval [0, 1] where MIC=1 corresponds to a perfect correlation.

The number M of samples of MIC needed to observe under H₀ (null hypothesis that the feature X is not time correlated), so that at least one of the MIC values is as large as MIC_(α) with probability at least p, is given by:

P(max(MIC ₁ , . . . , MIC _(M))≥MIC _(α))=1−(1−α)^(M) ≥p   (6)

where

${\geq \frac{\log \left( {1 - p} \right)}{\log \left( {1 - \alpha} \right)}}.$

For simplicity, set p=1−α. If α=0.05, then M on the order of 60 gives a 95% probability to obtain one MIC value (or more) in the 5% upper tail of the distribution.

The process selects a maximum observed value among the shuffles to be a threshold (1006). The maximum observed value in the M shufflings serves as a threshold for the feature X, given X and T and MIC(X, T)≠0. As further described below, the threshold will be used to determine whether to remove features.

The process determines a value for the measure of correlation without shuffling (1008). Continuing with the example of Maximal Information Coefficient (MIC), the process determines the MIC value of the data series of a feature X=[X₀, . . . , X_(i), . . . , X_(N)] without shuffling the data series.

The process removes a feature if the value for the measure of correlation without shuffling of the feature is larger than the threshold (1010). In other words, the process compares the value obtained at 1008 with the threshold obtained at 1006. A feature is removed if MIC(X) is larger than the determined threshold.

FIG. 11 shows an example of an explanation report according to an embodiment of the present disclosure. The explanation report (called an alarm report here) is generated using the process of FIG. 9.

In various embodiments, the explanation report includes one or more of the following sections:

-   -   Windows information with start and end timestamps for each         window (1102),     -   A feature importance ranking list (which can be truncated, e.g.,         top 10) (1104),     -   Validation curve to observe how well the ranking can lower the         signal (1106),     -   A table of the top N (e.g., 100) events that explain the alarm.         The table contains the feature values used by the machine         learning model (with columns ordered from left to right         according to the feature importance ranking). This may contain         some extra fields selected according to domain knowledge (e.g.,         emails, addresses, etc.) (1108).

The validation graph (1106) shows the robustness of the ranking provided by the machine learning model and can be generated as follows. Since the goal of the ranking is to push to the top the events that are responsible for distorting the distribution of model scores in the target window, removing events from the top of the list is expected to suppress the signal. Therefore, in the validation curve each point is the value of the signal using R as reference, but T with the top k events removed. For comparison, a curve is defined where, for each point, k events are randomly removed from T The drift score curve is not expected to lower the monitoring value if the alarm is a false positive. In that case the drift score curve (removal by drift score) should be similar or above the random curve.

Automatic model monitoring systems implemented using the techniques disclosed have yielded experimental results where new anomalies were detected compared to a conventional system with only a supervised machine learning model scoring component. Aggregating events and processing them using the disclosed techniques allow more anomalies to be detected including those that conventional systems are unable to detect. In one instance, an automatic model monitoring system was evaluated in five real world fraud detection datasets, each spanning periods up to eight months and totaling more than 22 million online transactions. The system generated around 100 reports, and domain experts reported that those reports are useful and that the system was able to detect anomalous events in a model life cycle. Labels are not needed in order to detect concept drift when using the techniques disclosed.

FIG. 12 is a block diagram illustrating an embodiment of a system in which automatic model monitoring for data streams can be implemented. The system includes one or more nodes in a cluster 1240 that perform automatic model monitoring. The environment includes one or more transaction devices 1202, 1204, 1206, gateway 1210, network 1220, issuer 1230, and a cluster 1240 made up of one or more nodes 1242.1, 1242.2. Transaction devices 1202-1206 collect transaction data, and transmit the transaction data via gateway 1210 to issuer 1230. Issuer 1230 verifies the transaction data to determine whether to approve the transaction. For example, processing a transaction involving a purchase includes receiving account information (e.g., credit/debit) and transaction details (e.g., purchase amount) at a transaction device and determining whether to approve the transaction. An approved transaction may mean that payment by the account is accepted in exchange for goods or services. A denied transaction may mean that payment by the account is denied.

In some embodiments, whether to approve or deny a transaction can be based on an assessment of the likelihood that the transaction is fraudulent by monitoring data streams using the techniques disclosed herein. In some embodiments, cluster 1240 is configured to perform the techniques disclosed herein to detect anomalies and provide an indication (such as an alarm report) to issuer 1230 or a third party such as a merchant.

By way of non-limiting example, transaction data may include one or more of: time of transaction, account/payment information (such as a credit card account number, a debit account number, or a bank account wire number), amount paid, currency, transaction location, merchant name, merchant address, category code, city, state, zip, country, terminal identification, authentication type, and the like. In some embodiments, account data is generated by the transaction device by processing/filtering the account information. For example, an account number can be encrypted/hashed to protect the account number. A transaction device may be implemented by a terminal, a point of sale (POS) device, or any other device that accepts account information. For example, a terminal includes a credit card terminal that processes payment based on a received credit card account number. The transaction device may receive and parse account information using a variety of electronic techniques such as a chip reader, a magnetic stripe reader, barcode scanner, etc. In some embodiments, a transaction device is associated with a location and may be identified by its associated location. For example, a brick and mortar retailer (BM) having three checkout terminals (12-3) each equipped with one of the transaction devices 1202-1206 may be identified by transaction devices BM12, BM2, and BM3. As another example, a transaction device is a website processing payment for goods and services purchased over the Internet.

A transaction location, which is typically associated with a transaction device, is a location where account information can be received to initiate a transaction. A transaction location may be a physical/geographical location, a location of a terminal, a Web location, and the like. Examples of transaction locations include checkout terminals, stores, a group of stores, or a system-wide (e.g., entire E-commerce merchant) location, and the like.

Misappropriated information (e.g., payment information) may be presented to a transaction device 1202-1206 for a purchase. If misappropriated information is used, then the transaction is fraudulent. During a transaction approval process or shortly after the transaction takes place, automatic model monitoring can be performed to identify and explain anomalous behavior. This signals that a transaction is potentially fraudulent. If applied during the transaction, a potentially fraudulent transaction may be prevented by declining the proffered payment method. If applied shortly after the transaction, the transaction may be reviewed and dis-approved or the payment method may be declined for subsequent transactions. This avoids future exploits of the payment method. Automatic model monitoring may also be used after a decision to review, approve, or decline a transactions as well as to detect and explain anomalous behavior related to other issues such as system problems or unusual flows of transactions into the system.

A transaction identified to be a potentially fraudulent transaction can trigger remedial action such as verifying with an issuer bank or with the card holder whether the card was used without authorization. If so, then the potentially fraudulent transaction is confirmed to be actually fraudulent. The determination of potentially fraudulent transactions may be used to block a payment type associated with the potentially fraudulent transaction from being used in the future. An anticipated transaction (e.g., future location or time) can be determined/predicted, and preempted by declining the payment type.

Gateway 1210 receives transaction data from one or more transaction devices 1202-1206, routes the transaction data to network 1220, and returns an approval or decline notice based on the approval process of network 1220. Gateway 1210 may include a payment acquirer or Internet Service Provider. For example, the payment acquirer may be software hosted on a third-party server that handles transmissions between a merchant (represented by transaction devices 1202-1206) and an issuer 1230. In some embodiments, a gateway is associated with an acquiring bank (also referred to as a merchant bank). The acquiring bank is registered with a network 1220, wherein the network represents a card association or card scheme (e.g., Visa®, MasterCard®, American Express®, etc.). The acquiring bank contracts with merchants to create and maintain accounts allowing the merchant to accept accounts such as credit and debit cards. In some embodiments, gateway 1210 processes and encrypts the transaction data before routing the transaction data. In some embodiments, gateway 1210 groups one or more transactions together and sends the batch of transactions to issuer 1230 via network 1220.

Network 1220 is a platform for transmitting data between devices to support payment processing and electronic payments. In some embodiments, network 1220 is associated with a credit card association or card scheme (e.g., Visa®, MasterCard®, American Express®, etc.) and supports communications between association members such as an acquiring bank (e.g., gateway 1210) and an issuing bank (e.g., issuer 1230). In some embodiments, network 1220 implements a clearing house to provide clearing and settlement services. Network 1220 determines an appropriate destination to route the transaction data. For example, several issuer banks may be members of the network. The network determines the issuer corresponding to the transaction data and routes the transaction to the appropriate issuer. For simplicity, only one issuer 1230 is shown in FIG. 12. In some embodiments, network 1220 filters the received transaction data. For example, network 1220 may be aware of fraudulent accounts and determine whether the received transaction data includes a fraudulent account. Network 1220 may include one or more network connected servers for processing, routing, and/or facilitating transactions.

Issuer 1230 receives transaction data from network 1220 and determines whether to approve or deny a transaction (e.g., a provided account/payment). For example, issuer 1230 includes one or more servers/systems of an issuing bank. In some embodiments, the issuer is associated with an acquiring bank via network 1220. In some embodiments, determining whether to approve or deny an account/payment method includes determining whether the transaction is potentially fraudulent.

Automatic model monitoring is useful for, among other things, detecting anomalies in a data stream. The automatic model monitoring includes generating an explanation report, which can be used for a variety of purposes including but not limiting to informing an administrator of a potential system issue, providing analytics to a data scientist, and determining whether to allow or deny a transaction. A transaction attempted to be performed by an account identified as likely compromised is denied. As another example, transaction authorization is handled as follows. Previously identified fraudulent transactions are stored in storage 1244. When performing transaction authorization based on received transaction information, issuer 1230 accesses storage 1244 to determine whether the received transaction information is associated with a transaction device/location previously identified as a potentially fraudulent transaction stored in storage 1244. For example, if the transaction information is similar to a previously-identified potentially fraudulent transaction, the issuer denies the transaction.

Storage 1244 stores information about transactions. Storage 1244 can be implemented by or include a variety of storage devices including devices for a memory hierarchy (cache, RAM, ROM, disk). In some embodiments, storage 1244 stores a list of potentially fraudulent transactions and/or a list of stolen/fraudulent accounts. The transaction information can be provided as a single transaction or a list of transactions. In some embodiments, a list of (past) transactions is stored in storage 1244 for a predetermined time, and is used to analyze subsequently-received transactions to provide output.

A payment verification process may take place within the environment shown in FIG. 12. In operation, a transaction device (1202, 1204, and/or 1206) receives transaction information such as account, time, amount, etc. as further described herein. In some embodiments, the transaction device processes the transaction information (e.g., packages the data). The transaction device sends the transaction data to gateway 1210. Gateway 1210 routes the received transaction data to network 1220. Network 1220 determines an issuer based on the transaction data, and sends the transaction data to the issuer. Issuer 1230 determines whether to approve or deny the transaction and detects system problems or unusual flows of transactions based on the transaction data and a security process performed by one or more nodes 1242.1, 1242.2. One or more nodes 1242.1, 1242.2 performs security processes to analyze the received transaction data and identify anomalies. The processes shown in FIGS. 2, 5, 9 are examples of security processes performed by cluster 1240.

Network 1220 and gateway 1210 relay an approval or decline notice back to the transaction device. If the transaction is approved, payment has been accepted and the transaction is successful. If the transaction is declined, payment has not been accepted and the transaction is declined.

In some embodiments, nodes of cluster 1240 are controlled and managed by issuer 1230. For example, devices/systems of the issuer or payment processing network retain transaction information and perform analysis to identify potentially fraudulent transactions. For example, the one or more nodes may be provided within the computing environment of issuer 1230. In some embodiments, nodes of cluster 1240 are controlled and managed by a third party. For example, issuer 1230 has contracted with the third party to perform analysis using data provided to the issuer (e.g., transaction information) to identify for the issuer likely potentially fraudulent transactions. One or more nodes of cluster 1240 perform the processes described herein, e.g., the processes shown in FIGS. 2, 5, 6, 9, 10.

FIG. 13 is a functional diagram illustrating a programmed computer system for automatic model monitoring in accordance with some embodiments. As will be apparent, other computer system architectures and configurations can be used to perform automatic model monitoring. Computer system 1300, which includes various subsystems as described below, includes at least one microprocessor subsystem (also referred to as a processor or a central processing unit (CPU)) 1302. For example, processor 1302 can be implemented by a single-chip processor or by multiple processors. In some embodiments, processor 1302 is a general purpose digital processor that controls the operation of the computer system 1300. Using instructions retrieved from memory 1380, the processor 1302 controls the reception and manipulation of input data, and the output and display of data on output devices (e.g., display 1318). In some embodiments, processor 1302 includes and/or is used to provide nodes 142.1 or 142.2 or cluster 140 in FIG. 1 and/or executes/performs the processes described above with respect to FIGS. 2, 5, 6, 9, 10.

Processor 1302 is coupled bi-directionally with memory 1380, which can include a first primary storage, typically a random access memory (RAM), and a second primary storage area, typically a read-only memory (ROM). As is well known in the art, primary storage can be used as a general storage area and as scratch-pad memory, and can also be used to store input data and processed data. Primary storage can also store programming instructions and data, in the form of data objects and text objects, in addition to other data and instructions for processes operating on processor 1302. Also as is well known in the art, primary storage typically includes basic operating instructions, program code, data, and objects used by the processor 1302 to perform its functions (e.g., programmed instructions). For example, memory 1380 can include any suitable computer-readable storage media, described below, depending on whether, for example, data access needs to be bi-directional or uni-directional. For example, processor 1302 can also directly and very rapidly retrieve and store frequently needed data in a cache memory (not shown).

A removable mass storage device 1312 provides additional data storage capacity for the computer system 1300, and is coupled either bi-directionally (read/write) or uni-directionally (read only) to processor 1302. For example, storage 1312 can also include computer-readable media such as magnetic tape, flash memory, PC-CARDS, portable mass storage devices, holographic storage devices, and other storage devices. A fixed mass storage 1320 can also, for example, provide additional data storage capacity. The most common example of mass storage 1320 is a hard disk drive. Mass storage 1312, 1320 generally store additional programming instructions, data, and the like that typically are not in active use by the processor 1302. It will be appreciated that the information retained within mass storage 1312 and 1320 can be incorporated, if needed, in standard fashion as part of memory 1380 (e.g., RAM) as virtual memory.

In addition to providing processor 1302 access to storage subsystems, bus 1314 can also be used to provide access to other subsystems and devices. As shown, these can include a display monitor 1318, a network interface 1316, a keyboard 1304, and a pointing device 1306, as well as an auxiliary input/output device interface, a sound card, speakers, and other subsystems as needed. For example, the pointing device 1306 can be a mouse, stylus, track ball, or tablet, and is useful for interacting with a graphical user interface.

The network interface 1316 allows processor 1302 to be coupled to another computer, computer network, or telecommunications network using a network connection as shown. For example, through the network interface 1316, the processor 1302 can receive information (e.g., data objects or program instructions) from another network or output information to another network in the course of performing method/process steps. Information, often represented as a sequence of instructions to be executed on a processor, can be received from and outputted to another network. An interface card or similar device and appropriate software implemented by (e.g., executed/performed on) processor 1302 can be used to connect the computer system 1300 to an external network and transfer data according to standard protocols. For example, various process embodiments disclosed herein can be executed on processor 1302, or can be performed across a network such as the Internet, intranet networks, or local area networks, in conjunction with a remote processor that shares a portion of the processing. Additional mass storage devices (not shown) can also be connected to processor 1302 through network interface 1316.

An auxiliary I/O device interface (not shown) can be used in conjunction with computer system 1300. The auxiliary I/O device interface can include general and customized interfaces that allow the processor 1302 to send and, more typically, receive data from other devices such as microphones, touch-sensitive displays, transducer card readers, tape readers, voice or handwriting recognizers, biometrics readers, cameras, portable mass storage devices, and other computers.

In addition, various embodiments disclosed herein further relate to computer storage products with a computer readable medium that includes program code for performing various computer-implemented operations. The computer-readable medium is any data storage device that can store data which can thereafter be read by a computer system. Examples of computer-readable media include, but are not limited to, all the media mentioned above: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media such as optical disks; and specially configured hardware devices such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs), and ROM and RAM devices. Examples of program code include both machine code, as produced, for example, by a compiler, or files containing higher level code (e.g., script) that can be executed using an interpreter.

The computer system shown in FIG. 13 is but an example of a computer system suitable for use with the various embodiments disclosed herein. Other computer systems suitable for such use can include additional or fewer subsystems. In addition, bus 1314 is illustrative of any interconnection scheme serving to link the subsystems. Other computer architectures having different configurations of subsystems can also be utilized.

Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive. 

What is claimed is:
 1. A method comprising: obtaining model scores from a first machine learning model; training a second machine learning model to learn how to differentiate between two groups based on at least one of: features and the model scores obtained from the first machine learning model; applying the second machine learning model to each data record in a first group of data records to determine a corresponding ranking score for each data record in the first group; and based on the corresponding ranking scores, determining a relative contribution of each of the data records in the first group to the differentiation between the first group of data records and a second group of data records.
 2. The method of claim 1, wherein at least a portion of the features that are available for the data records of both groups are used.
 3. The method of claim 1, wherein the first group includes data records in a target window and the second group includes data records in a reference window.
 4. The method of claim 1, further comprising removing index-correlated features prior to training the second machine learning model.
 5. The method of claim 1, further comprising removing time-correlated features prior to training the second machine learning model.
 6. The method of claim 5, wherein removing time-correlated features prior to training the second machine learning model includes: obtaining a data series associated with a distribution of values that generated the data records; shuffling the data series randomly a predetermined number of times; calculating corresponding values of a measure of correlation for each shuffle; selecting a maximum observed value among the shuffles to be a threshold; determining a value for the measure of correlation without shuffling; and removing a feature if the value for the measure of correlation without shuffling of the feature is larger than the threshold.
 7. The method of claim 6, wherein the measure of correlation is sensitive to non-linear relations.
 8. The method of claim 6, wherein the measure of correlation includes a Maximal Information Coefficient (MIC).
 9. The method of claim 6, wherein shuffling the data series randomly a predetermined number of times includes choosing the predetermined number of times to ensure a statistical confidence above a threshold.
 10. The method of claim 1, wherein the second machine learning model includes a measure of feature importance for correlated features.
 11. The method of claim 10, wherein the second machine learning model is a Gradient Boosted Decision Trees (GBDT) model.
 12. The method of claim 1, further comprising outputting an explanation report in response to an anomaly in data records of at least one of the first group and the second group.
 13. The method of claim 1, further comprising outputting an explanation report including window start and end timestamps.
 14. The method of claim 1, further comprising outputting an explanation report including a feature importance ranking list based at least in part on the ranking scores.
 15. The method of claim 1, further comprising outputting an explanation report including a list of a predetermined number of top data records.
 16. The method of claim 15, wherein the list of a predetermined number of top data records includes feature values used by the second machine learning model.
 17. The method of claim 1, further comprising outputting an explanation report including a validation curve to show how well a ranking of the data records can lower a monitoring value.
 18. The method of claim 17, wherein the validation curve includes values using a target window with a predetermined number of top events removed.
 19. A system comprising: a processor configured to: obtain model scores from a first machine learning model; train a second machine learning model to learn how to differentiate between two groups based on at least one of: features and the model scores obtained from the first machine learning model; apply the second machine learning model to each data record in a first group of data records to determine a corresponding ranking score for each data record in the first group; and based on the corresponding ranking scores, determine a relative contribution of each of the data records in the first group to the differentiation between the first group of data records and a second group of data records; and a memory coupled to the processor and configured to provide the processor with instructions.
 20. A computer program product embodied in a non-transitory computer readable storage is medium and comprising computer instructions for: obtaining model scores from a first machine learning model; training a second machine learning model to learn how to differentiate between two groups based on at least one of: features and the model scores obtained from the first machine learning model; applying the second machine learning model to each data record in a first group of data records to determine a corresponding ranking score for each data record in the first group; and based on the corresponding ranking scores, determining a relative contribution of each of the data records in the first group to the differentiation between the first group of data records and a second group of data records. 